Skip to article frontmatterSkip to article content

Simulation-Based Approximate Dynamic Programming

In the previous chapter, we developed the projection method framework for solving functional equations. We saw how to transform the infinite-dimensional problem of finding vv such that Lv=v\mathrm{L}v = v into a finite-dimensional one by choosing basis functions and projection conditions (least squares, Galerkin, collocation). The main idea was to make the residual R(s)=Lv(s)v(s)R(s) = \mathrm{L}v(s) - v(s) “small” according to some criterion—whether by minimizing its squared norm, requiring orthogonality to basis functions, or forcing it to vanish at specific collocation points.

However, we left one critical question unresolved: how do we actually evaluate the Bellman operator at a state ss? Recall that applying L\mathrm{L} requires computing an expectation:

(Lv)(s)=maxaAs{r(s,a)+γv(s)p(dss,a)}(\mathrm{L} v)(s) = \max_{a \in \mathcal{A}_s} \left\{r(s,a) + \gamma \int v(s')p(ds'|s,a)\right\}

In our discussion of projection methods, we remained agnostic about numerical integration—we didn’t commit to a specific technique for evaluating this integral. When the state space is discrete and small, this expectation is simply a finite sum we can compute exactly. When the state space is continuous or very large, we face a computational challenge.

This chapter addresses that challenge by introducing Monte Carlo integration as our method for approximating expectations. Rather than requiring explicit knowledge of the transition probability function p(dss,a)p(ds'|s,a) and using deterministic quadrature rules, we will work with samples—observed next states drawn from the transition dynamics. This is the defining feature of what the operations research community calls simulation-based approximate dynamic programming: the combination of projection methods (to handle infinite-dimensional function spaces) with Monte Carlo integration (to handle intractable expectations via sampling).

This same combination is precisely what the machine learning community recognizes as reinforcement learning. By relying on samples rather than exact probabilities, we move from planning with a known model to learning from data. The algorithms we develop will work in settings where we can simulate or interact with the system but may not have access to—or may not want to use—explicit transition probabilities. This is the bridge from the model-based dynamic programming of earlier chapters to the data-driven learning setting that defines modern RL.

Discretization and Numerical Quadrature

Recall that the Bellman operator for continuous state spaces takes the form:

(Lv)(s)maxaAs{r(s,a)+γv(s)p(dss,a)},sS(\mathrm{L} v)(s) \equiv \max_{a \in \mathcal{A}_s} \left\{r(s,a) + \gamma \int v(s')p(ds'|s,a)\right\}, \, \forall s \in \mathcal{S}

To make this computationally tractable for continuous state spaces, we can discretize the state space and approximate the integral over this discretized space. While this allows us to evaluate the Bellman operator componentwise, we must first decide how to represent value functions in this discretized setting.

When working with discretized representations, we partition the state space into NsN_s cells with centers at grid points {si}i=1Ns\{s_i\}_{i=1}^{N_s}. We then work with value functions that are piecewise constant on each cell: ie. for any sSs \in S: v(s)=v(sk(s))v(s) = v(s_{k(s)}) where k(s)k(s) is the index of the cell containing ss. We denote the discretized reward function by rh(s,a)r(sk(s),a) r_h(s, a) \equiv r(s_{k(s)}, a) .

For transition probabilities, we need to be more careful. While we similarly map any state-action pair to its corresponding cell, we must ensure that integrating over the discretized transition function yields a valid probability distribution. We achieve this by normalizing:

ph(ss,a)p(sk(s)sk(s),a)p(sk(s)sk(s),a)dsp_h(s'|s, a) \equiv \frac{p(s_{k(s')}|s_{k(s)}, a)}{\int p(s_{k(s')}|s_{k(s)}, a) ds'}

After defining our discretized reward and transition probability functions, we can write down our discretized Bellman operator. We start with the Bellman operator using our discretized functions rhr_h and php_h. While these functions map to grid points, they’re still defined over continuous spaces - we haven’t yet dealt with the computational challenge of the integral. With this discretization approach, the value function is piecewise constant over cells. This lets us express the integral as a sum over cells, where each cell’s contribution is the probability of transitioning to that cell multiplied by the value at that cell’s grid point:

(L^hv)(s)=maxk=1,,Na{rh(s,ak)+γv(s)ph(ss,ak)ds}=maxk=1,,Na{rh(s,ak)+γv(sk(s))ph(ss,ak)ds}=maxk=1,,Na{rh(s,ak)+γi=1Nsv(si)celliph(ss,ak)ds}=maxk=1,,Na{rh(s,ak)+γi=1Nsv(si)ph(sisk(s),ak)}\begin{aligned} (\widehat{\mathrm{L}}_h v)(s) &= \max_{k=1,\ldots,N_a} \left\{r_h(s, a_k) + \gamma \int v(s')p_h(s'|s, a_k)ds'\right\} \\ &= \max_{k=1,\ldots,N_a} \left\{r_h(s, a_k) + \gamma \int v(s_{k(s')})p_h(s'|s, a_k)ds'\right\} \\ &= \max_{k=1,\ldots,N_a} \left\{r_h(s, a_k) + \gamma \sum_{i=1}^{N_s} v(s_i) \int_{cell_i} p_h(s'|s, a_k)ds'\right\} \\ &= \max_{k=1,\ldots,N_a} \left\{r_h(s, a_k) + \gamma \sum_{i=1}^{N_s} v(s_i)p_h(s_i|s_{k(s)}, a_k)\right\} \end{aligned}

This form makes clear how discretization converts our continuous-space problem into a finite computation: we’ve replaced integration over continuous space with summation over grid points. The price we pay is that the number of terms in our sum grows exponentially with the dimension of our state space - the familiar curse of dimensionality.

Monte Carlo Integration

Numerical quadrature methods scale poorly with increasing dimension. Specifically, for a fixed error tolerance ϵ\epsilon, the number of required quadrature points grows exponentially with dimension dd as O((1ϵ)d)O\left(\left(\frac{1}{\epsilon}\right)^d\right). Furthermore, quadrature methods require explicit evaluation of the transition probability function p(ss,a)p(s'|s,a) at specified points—a luxury we don’t have in the “model-free” setting where we only have access to samples from the MDP.

Let B={s1,,sM}\mathcal{B} = \{s_1, \ldots, s_M\} be our set of base points where we will evaluate the operator. At each base point skBs_k \in \mathcal{B}, Monte Carlo integration approximates the expectation using NN samples:

vn(s)p(dssk,a)1Ni=1Nvn(sk,i),sk,ip(sk,a)\int v_n(s')p(ds'|s_k,a) \approx \frac{1}{N} \sum_{i=1}^N v_n(s'_{k,i}), \quad s'_{k,i} \sim p(\cdot|s_k,a)

where sk,is'_{k,i} denotes the ii-th sample drawn from p(sk,a)p(\cdot|s_k,a) for base point sks_k. This approach has two properties making it particularly attractive for high-dimensional problems and model-free settings:

  1. The convergence rate is O(1N)O\left(\frac{1}{\sqrt{N}}\right) regardless of the number of dimensions

  2. It only requires samples from p(sk,a)p(\cdot|s_k,a), not explicit probability values

Using this approximation of the expected value over the next state, we can define a new “empirical” Bellman optimality operator:

(L^Nv)(sk)maxaAsk{r(sk,a)+γNi=1Nv(sk,i)},sk,ip(sk,a)(\widehat{\mathrm{L}}_N v)(s_k) \equiv \max_{a \in \mathcal{A}_{s_k}} \left\{r(s_k,a) + \frac{\gamma}{N} \sum_{i=1}^N v(s'_{k,i})\right\}, \quad s'_{k,i} \sim p(\cdot|s_k,a)

for each skBs_k \in \mathcal{B}. A direct adaptation of the successive approximation method for this empirical operator leads to:

Note that the original error bound derived as a termination criterion for value iteration need not hold in this approximate setting. Hence, we use a generic termination criterion based on computational budget and desired tolerance. While this aspect could be improved, we’ll focus on a more pressing matter: the algorithm’s tendency to produce upwardly biased values. In other words, this algorithm “thinks” the world is more rosy than it actually is - it overestimates values.

Overestimation Bias in Monte Carlo Value Iteration

In statistics, bias refers to a systematic error where an estimator consistently deviates from the true parameter value. For an estimator θ^\hat{\theta} of a parameter θ\theta, we define bias as: Bias(θ^)=E[θ^]θ\text{Bias}(\hat{\theta}) = \mathbb{E}[\hat{\theta}] - \theta. While bias isn’t always problematic — sometimes we deliberately introduce bias to reduce variance, as in ridge regression — uncontrolled bias can lead to significantly distorted results. In the context of value iteration, this distortion gets amplified even more due to the recursive nature of the algorithm.

Consider how the Bellman operator works in value iteration. At iteration n, we have a value function estimate vi(s)v_i(s) and aim to improve it by applying the Bellman operator L\mathrm{L}. The ideal update would be:

(Lvi)(s)=maxaA(s){r(s,a)+γvi(s)p(dss,a)}(\mathrm{{L}}v_i)(s) = \max_{a \in \mathcal{A}(s)} \left\{ r(s,a) + \gamma \int v_i(s') p(ds'|s,a) \right\}

However, we can’t compute this integral exactly and use Monte Carlo integration instead, drawing NN next-state samples for each state and action pair. The bias emerges when we take the maximum over actions:

(L^vi)(s)=maxaA(s)q^i(s,a),whereq^i(s,a)r(s,a)+γNj=1Nvi(sj),sjp(s,a)(\widehat{\mathrm{L}}v_i)(s) = \max_{a \in \mathcal{A}(s)} \hat{q}_i(s,a), \enspace \text{where} \enspace \hat{q}_i(s,a) \equiv r(s,a) + \frac{\gamma}{N} \sum_{j=1}^N v_i(s'_j), \quad s'_j \sim p(\cdot|s,a)

White the Monte Carlo estimate q^n(s,a)\hat{q}_n(s,a) is unbiased for any individual action, the empirical Bellman operator is biased upward due to Jensen’s inequality, which states that for any convex function ff, we have E[f(X)]f(E[X])\mathbb{E}[f(X)] \geq f(\mathbb{E}[X]). Since the maximum operator is convex, this implies:

E[(L^vi)(s)]=E[maxaA(s)q^i(s,a)]maxaA(s)E[q^i(s,a)]=(Lvi)(s)\mathbb{E}[(\widehat{\mathrm{L}}v_i)(s)] = \mathbb{E}\left[\max_{a \in \mathcal{A}(s)} \hat{q}_i(s,a)\right] \geq \max_{a \in \mathcal{A}(s)} \mathbb{E}[\hat{q}_i(s,a)] = (\mathrm{L}v_i)(s)

This means that our Monte Carlo approximation of the Bellman operator is biased upward:

bi(s)=E[(L^vi)(s)](Lvi)(s)0b_i(s) = \mathbb{E}[(\widehat{\mathrm{L}}v_i)(s)] - (\mathrm{L}v_i)(s) \geq 0

Even worse, this bias compounds through iterations as each new value function estimate vn+1v_{n+1} is based on targets generated by the biased operator L^\widehat{\mathrm{L}}, creating a nested structure of bias accumulation. This bias remains nonnegative at every step, and each application of the Bellman operator potentially adds more upward bias. As a result, instead of converging to the true value function vv^*, the algorithm typically stabilizes at a biased approximation that systematically overestimates true values.

The Keane-Wolpin Bias Correction Algorithm

Keane and Wolpin proposed to de-bias such estimators by essentially “learning” the bias, then subtracting it when computing the empirical Bellman operator. If we knew this bias function, we could subtract it from our empirical estimate to get an unbiased estimate of the true Bellman operator:

(L^vn)(s)bias(s)=(L^vn)(s)(E[(L^vn)(s)](Lvn)(s))(Lvn)(s)(\widehat{\mathrm{L}}v_n)(s) - \text{bias}(s) = (\widehat{\mathrm{L}}v_n)(s) - (\mathbb{E}[(\widehat{\mathrm{L}}v_n)(s)] - (\mathrm{L}v_n)(s)) \approx (\mathrm{L}v_n)(s)

This equality holds in expectation, though any individual estimate would still have variance around the true value.

So how can we estimate the bias function? The Keane-Wolpin manages this using an important fact from extreme value theory: for normal random variables, the difference between the expected maximum and maximum of expectations scales with the standard deviation:

E[maxaAq^i(s,a)]maxaAE[q^i(s,a)]cmaxaAVari(s,a)\mathbb{E}\left[\max_{a \in \mathcal{A}} \hat{q}_i(s,a)\right] - \max_{a \in \mathcal{A}} \mathbb{E}[\hat{q}_i(s,a)] \approx c \cdot \sqrt{\max_{a \in \mathcal{A}} \text{Var}_i(s,a)}

The variance term maxaAVari(s,a)\max_{a \in \mathcal{A}} \text{Var}_i(s,a) will typically be dominated by the action with the largest value -- the greedy action ai(s)a^*_i(s). Rather than deriving the constant cc theoretically, Keane-Wolpin proposed learning the relationship between variance and bias empirically through these steps:

  1. Select a small set of “benchmark” states (typically 20-50) that span the state space

  2. For these states, compute more accurate value estimates using many more Monte Carlo samples (10-100x more than usual)

  3. Compute the empirical bias at each benchmark state ss:

    b^i(s)=(L^vi)(s)(L^accuratevi)(s)\hat{b}_i(s) = (\hat{\mathrm{L}}v_i)(s) - (\hat{\mathrm{L}}_{\text{accurate}}v_i)(s)
  4. Fit a linear relationship between this bias and the variance at the greedy action:

    b^i(s)=αiVari(s,ai(s))+ϵ\hat{b}_i(s) = \alpha_i \cdot \text{Var}_i(s,a^*_i(s)) + \epsilon

This creates a dataset of pairs (Vari(s,ai(s)),b^i(s))(\text{Var}_i(s,a^*_i(s)), \hat{b}_i(s)) that can be used to estimate αi\alpha_i through ordinary least squares regression. Once we have learned this bias function b^\hat{b}, we can define the bias-corrected Bellman operator:

(L~vi)(s)(L^vi)(s)b^(s)(\widetilde{\mathrm{L}}v_i)(s) \triangleq (\hat{\mathrm{L}}v_i)(s) - \hat{b}(s)

While this bias correction approach has been influential in econometrics, it hasn’t gained much traction in the machine learning community. A major drawback is the need for accurate operator estimation at benchmark states, which requires allocating substantially more samples to these states. In the next section, we’ll explore an alternative strategy that, while requiring the maintenance of two sets of value estimates, achieves bias correction without demanding additional samples.

Decoupling Selection and Evaluation

A simpler approach to addressing the upward bias is to maintain two separate q-function estimates - one for action selection and another for evaluation. Let’s first start by looking at the corresponding Monte Carlo value iteration algorithm and then convince ourselves that this is good idea using math. Assume a monte carlo integration setup over Q factors:

In this algorithm, we maintain two separate Q-functions (qAq^A and qBq^B) and use them asymmetrically: when updating qAq^A, we use network A to select the best action (ai=argmaxaqiA(sj,a)a^*_i = \arg\max_{a'} q^A_i(s'_j,a')) but then evaluate that action using network B’s estimates (qiB(sj,ai)q^B_i(s'_j,a^*_i)). We do the opposite for updating qBq^B. You can see this separation in steps 3.2.2 and 3.2.3 of the algorithm, where for each network update, we first use one network to pick the action and then plug that chosen action into the other network for evaluation. We will see that this decomposition helps mitigate the positive bias that occurs due to Jensen’s inequality.

An HVAC analogy

Consider a building where each HVAC unit ii has some true maximum power draw μi\mu_i under worst-case conditions. Let’s pretend that we don’t have access to manufacturer datasheets, so we need to estimate these maxima from actual measurements. Now the challenge is that power draw fluctuates with environmental conditions. If we use a single day’s measurements and look at the highest power draw, we systematically overestimate the true maximum draw across all units.

To see this, let XAiX_A^i be unit i’s power draw on day A and XBiX_B^i be unit i’s power draw on day. Wile both measurements are unbiased E[XAi]=E[XBi]=μi\mathbb{E}[X_A^i] = \mathbb{E}[X_B^i] = \mu_i, their maximum is not due to Jensen’s inequality:

E[maxiXAi]maxiE[XAi]=maxiμi\mathbb{E}[\max_i X_A^i] \geq \max_i \mathbb{E}[X_A^i] = \max_i \mu_i

Intuitively, this problem occurs because reading tends to come from units that experienced particularly demanding conditions (e.g., direct sunlight, full occupancy, peak humidity) rather than just those with high true maximum draw. To estimate the true maximum power draw more accurately, we use the following measurement protocol:

  1. Use day A measurements to select which unit hit the highest peak

  2. Use day B measurements to evaluate that unit’s power consumption

This yields the estimator:

Y=XBargmaxiXAiY = X_B^{\arg\max_i X_A^i}

We can show that by decoupling selection and evaluation in this fashion, our estimator YY will no longer systematically overestimate the true maximum draw. First, observe that argmaxiXAi\arg\max_i X_A^i is a random variable (call it JJ) - it tells us which unit had highest power draw on day A. It has some probability distribution based on day A’s conditions: P(J=j)=P(argmaxiXAi=j)P(J = j) = P(\arg\max_i X_A^i = j). Using the law of total expectation:

E[Y]=E[XBJ]=E[E[XBJJ]] (by tower property)=j=1nE[XBjJ=j]P(J=j)=j=1nE[XBjargmaxiXAi=j]P(argmaxiXAi=j)\begin{align*} \mathbb{E}[Y] = \mathbb{E}[X_B^J] &= \mathbb{E}[\mathbb{E}[X_B^J \mid J]] \text{ (by tower property)} \\ &= \sum_{j=1}^n \mathbb{E}[X_B^j \mid J = j] P(J = j) \\ &= \sum_{j=1}^n \mathbb{E}[X_B^j \mid \arg\max_i X_A^i = j] P(\arg\max_i X_A^i = j) \end{align*}

Now we need to make an imporant observation: Unit j’s power draw on day B (XBjX_B^j) is independent of whether it had the highest reading on day A ({argmaxiXAi=j}\{\arg\max_i X_A^i = j\}). An extreme cold event on day A shouldn’t affect day B’s readings(especially in Quebec where the wheather tend to vary widely from day to day). Therefore:

E[XBjargmaxiXAi=j]=E[XBj]=μj\mathbb{E}[X_B^j \mid \arg\max_i X_A^i = j] = \mathbb{E}[X_B^j] = \mu_j

This tells us that the two-day estimator is now an average of the true underlying power consumptions:

E[Y]=j=1nμjP(argmaxiXAi=j)\mathbb{E}[Y] = \sum_{j=1}^n \mu_j P(\arg\max_i X_A^i = j)

To analyze E[Y] \mathbb{E}[Y] more closely, let’s use a general result: if we have a real-valued function f f defined on a discrete set of units {1,,n} \{1, \dots, n\} and a probability distribution q() q(\cdot) over these units, then the maximum value of f f across all units is at least as large as the weighted sum of f f values with weights q q . Formally,

maxj{1,,n}f(j)j=1nq(j)f(j).\max_{j \in \{1, \dots, n\}} f(j) \geq \sum_{j=1}^n q(j) f(j).

Applying this to our setting, we set f(j)=μj f(j) = \mu_j (the true maximum power draw for unit j j ) and q(j)=P(J=j) q(j) = P(J = j) (the probability that unit j j achieves the maximum reading on day A). This gives us:

maxj{1,,n}μjj=1nP(J=j)μj=E[Y].\max_{j \in \{1, \dots, n\}} \mu_j \geq \sum_{j=1}^n P(J = j) \mu_j = \mathbb{E}[Y].

Therefore, the expected value of Y Y (our estimator) will always be less than or equal to the true maximum value maxjμj \max_j \mu_j . In other words, Y Y provides a conservative estimate of the true maximum: it tends not to overestimate maxjμj \max_j \mu_j but instead approximates it as closely as possible without systematic upward bias.

Consistency

Even though Y Y is not an unbiased estimator of maxjμj \max_j \mu_j (since E[Y]maxjμj \mathbb{E}[Y] \leq \max_j \mu_j ), it is consistent. As more independent days (or measurements) are observed, the selection-evaluation procedure becomes more effective at isolating the intrinsic maximum, reducing the influence of day-specific environmental fluctuations. Over time, this approach yields a stable and increasingly accurate approximation of maxjμj \max_j \mu_j .

To show that Y Y is a consistent estimator of maxiμi \max_i \mu_i , we want to demonstrate that as the number of independent measurements (days, in this case) increases, Y Y converges in probability to maxiμi \max_i \mu_i . Let’s suppose we have m m independent days of measurements for each unit. Denote:

The estimator we construct is: Ym=XB(Jm), Y_m = X_B^{(J_m)}, where XB(Jm) X_B^{(J_m)} is the power draw of the selected unit Jm J_m on an independent evaluation day B B . We will now show that Ym Y_m converges to maxiμi \max_i \mu_i as m m \to \infty . This involves two main steps:

  1. Consistency of the Selection Step Jm J_m : As m m \to \infty , the unit selected by Jm J_m will tend to be the one with the true maximum power draw maxiμi \max_i \mu_i .

  2. Convergence of Ym Y_m to μJm \mu_{J_m} : Since the evaluation day B B measurement XB(Jm) X_B^{(J_m)} is unbiased with expectation μJm \mu_{J_m} , as m m \to \infty , Ym Y_m will converge to μJm \mu_{J_m} , which in turn converges to maxiμi \max_i \mu_i .

The average power draw over m m days for each unit i i is:

1mk=1mXA(k),i.\frac{1}{m} \sum_{k=1}^m X_A^{(k),i}.

By the law of large numbers, as m m \to \infty , this sample average converges to the true expected power draw μi \mu_i for each unit i i :

1mk=1mXA(k),imμi.\frac{1}{m} \sum_{k=1}^m X_A^{(k),i} \xrightarrow{m \to \infty} \mu_i.

Since Jm J_m selects the unit with the highest sample average, in the limit, Jm J_m will almost surely select the unit with the highest true mean, maxiμi \max_i \mu_i . Thus, as m m \to \infty ,

μJmmaxiμi.\mu_{J_m} \to \max_i \mu_i.

Given that Jm J_m identifies the unit with the maximum true mean power draw in the limit, we now look at Ym=XB(Jm) Y_m = X_B^{(J_m)} , which is the power draw of unit Jm J_m on the independent evaluation day B B .

Since XB(Jm) X_B^{(J_m)} is an unbiased estimator of μJm \mu_{J_m} , we have:

E[YmJm]=μJm.\mathbb{E}[Y_m \mid J_m] = \mu_{J_m}.

As m m \to \infty , μJm \mu_{J_m} converges to maxiμi \max_i \mu_i . Thus, Ym Y_m will also converge in probability to maxiμi \max_i \mu_i because Ym Y_m is centered around μJm \mu_{J_m} and Jm J_m converges to the index of the unit with maxiμi \max_i \mu_i .

Combining these two steps, we conclude that:

Ymmmaxiμi in probability.Y_m \xrightarrow{m \to \infty} \max_i \mu_i \text{ in probability}.

This establishes the consistency of Y Y as an estimator for maxiμi \max_i \mu_i : as the number of independent measurements grows, Ym Y_m converges to the true maximum power draw maxiμi \max_i \mu_i .

Parametric Dynamic Programming

We have so far considered a specific kind of approximation: that of the Bellman operator itself. We explored a modified version of the operator with the desirable property of smoothness, which we deemed beneficial for optimization purposes and due to its rich multifaceted interpretations. We now turn our attention to another form of approximation, complementary to the previous kind, which seeks to address the challenge of applying the operator across the entire state space.

To be precise, suppose we can compute the Bellman operator Lv\mathrm{L}v at some state ss, producing a new function UU whose value at state ss is u(s)=(Lv)(s)u(s) = (\mathrm{L}v)(s). Then, putting aside the problem of pointwise evaluation, we want to carry out this update across the entire domain of vv. When working with small state spaces, this is not an issue, and we can afford to carry out the update across the entirety of the state space. However, for larger or infinite state spaces, this becomes a major challenge.

So what can we do? Our approach will be to compute the operator at chosen “grid points,” then “fill in the blanks” for the states where we haven’t carried out the update by “fitting” the resulting output function on a dataset of input-output pairs. The intuition is that for sufficiently well-behaved functions and sufficiently expressive function approximators, we hope to generalize well enough. Our community calls this “learning,” while others would call it “function approximation” — a field of its own in mathematics. To truly have a “learning algorithm,” we’ll need to add one more piece of machinery: the use of samples — of simulation — to pick the grid points and perform numerical integration. But this is for the next section...

Partial Updates in the Tabular Case

The ideas presented in this section apply more broadly to the successive approximation method applied to a fixed-point problem. Consider again the problem of finding the optimal value function vγv_\gamma^\star as the solution to the Bellman optimality operator L\mathrm{L}:

LvmaxπΠMD{rπ+γPπv}\mathrm{L} \mathbf{v} \equiv \max_{\pi \in \Pi^{MD}} \left\{\mathbf{r}_\pi + \gamma \mathbf{P}_\pi \mathbf{v}\right\}

Value iteration -- the name for the method of successive approximation applied to L\mathrm{L} -- computes a sequence of iterates vn+1=Lvnv_{n+1} = \mathrm{L}v_n from some arbitrary v0v_0. Let’s pause to consider what the equality sign in this expression means: it represents an assignment (perhaps better denoted as :=:=) across the entire domain. This becomes clearer when writing the update in component form:

vn+1(s):=(Lvn)(s)maxaAs{r(s,a)+γjSp(js,a)vn(j)},sSv_{n+1}(s) := (\mathrm{L} v_n)(s) \equiv \max_{a \in \mathcal{A}_s} \left\{r(s,a) + \gamma \sum_{j \in \mathcal{S}} p(j|s,a) v_n(j)\right\}, \, \forall s \in \mathcal{S}

Pay particular attention to the sS\forall s \in \mathcal{S} notation: what happens when we can’t afford to update all components in each step of value iteration? A potential solution is to use Gauss-Seidel Value Iteration, which updates states sequentially, immediately using fresh values for subsequent updates.

The Gauss-Seidel value iteration approach offers several advantages over standard value iteration: it can be more memory-efficient and often leads to faster convergence. This idea generalizes further (see for example Bertsekas (1983)) to accommodate fully asynchronous updates in any order. However, these methods, while more flexible in their update patterns, still fundamentally rely on a tabular representation—that is, they require storing and eventually updating a separate value for each state in memory. Even if we update states one at a time or in blocks, we must maintain this complete table of values, and our convergence guarantee assumes that every entry in this table will eventually be revised.

But what if maintaining such a table is impossible? This challenge arises naturally when dealing with continuous state spaces, where we cannot feasibly store values for every possible state, let alone update them. This is where function approximation comes into play.

Parametric Value Iteration

In the parametric approach to dynamic programming, instead of maintaining an explicit table of values, we represent the value function using a parametric function approximator v(s;θ)v(s; \boldsymbol{\theta}), where θ\boldsymbol{\theta} are parameters that get adjusted across iterations rather than the entries of a tabular representation. This idea traces back to the inception of dynamic programming and was described as early as 1963 by Bellman himself, who considered polynomial approximations. For a value function v(s)v(s), we can write its polynomial approximation as:

v(s)i=0nθiϕi(s)v(s) \approx \sum_{i=0}^{n} \theta_i \phi_i(s)

where:

As we discussed earlier in the context of trajectory optimization, we can choose from different polynomial bases beyond the usual monomial basis ϕi(s)=si\phi_i(s) = s^i, such as Legendre or Chebyshev polynomials. While polynomials offer attractive mathematical properties, they become challenging to work with in higher dimensions due to the curse of dimensionality. This limitation motivates our later turn to neural network parameterizations, which scale better with dimensionality.

Given a parameterization, our value iteration procedure must now update the parameters θ\boldsymbol{\theta} rather than tabular values directly. At each iteration, we aim to find parameters that best approximate the Bellman operator’s output at chosen base points. More precisely, we collect a dataset:

Dn={(si,(Lv)(si;θn))siB}\mathcal{D}_n = \{(s_i, (\mathrm{L}v)(s_i; \boldsymbol{\theta}_n)) \mid s_i \in B\}

and fit a regressor v(;θn+1)v(\cdot; \boldsymbol{\theta}_{n+1}) to this data.

This process differs from standard supervised learning in a specific way: rather than working with a fixed dataset, we iteratively generate our training targets using the previous value function approximation. During this process, the parameters θn\boldsymbol{\theta}_n remain “frozen”, entering only through dataset creation. This naturally leads to maintaining two sets of parameters:

This target model framework emerges naturally from the structure of parametric value iteration — an insight that provides theoretical grounding for modern deep reinforcement learning algorithms where we commonly hear about the importance of the “target network trick” .

Parametric successive approximation, known in reinforcement learning literature as Fitted Value Iteration, offers a flexible template for deriving new algorithms by varying the choice of function approximator. Various instantiations of this approach have emerged across different fields:

The \texttt{fit} function in our algorithm represents this supervised learning step and can be implemented using any standard regression tool that follows the scikit-learn interface. This flexibility in choice of function approximator allows practitioners to leverage the extensive ecosystem of modern machine learning tools while maintaining the core dynamic programming structure.

The structure of the above algorithm mirrors value iteration in its core idea of iteratively applying the Bellman operator. However, several key modifications distinguish this fitted variant:

First, rather than applying updates across the entire state space, we compute the operator only at selected base points BB. The resulting values are then stored implicitly through the parameter vector θ\boldsymbol{\theta} via the fitting step, rather than explicitly as in the tabular case.

The fitting procedure itself may introduce an “inner optimization loop.” For instance, when using neural networks, this involves an iterative gradient descent procedure to optimize the parameters. This creates an interesting parallel with modified policy iteration: just as we might truncate policy evaluation steps there, we can consider variants where this inner loop runs for a fixed number of iterations rather than to convergence.

Finally, the termination criterion from standard value iteration may no longer hold. The classical criterion relied on the sup-norm contractivity property of the Bellman operator — a property that isn’t generally preserved under function approximation. While certain function approximation schemes can maintain this sup-norm contraction property (as we’ll see later), this is the exception rather than the rule.

Parametric Policy Iteration

We can extend this idea of fitting partial operator updates to the policy iteration setting. Remember, policy iteration involves iterating in the space of policies rather than in the space of value functions. Given an initial guess on a deterministic decision rule d0d_0, we iteratively:

  1. Compute the value function for the current policy (policy evaluation)

  2. Derive a new improved policy (policy improvement)

When computationally feasible and under the model-based setting, we can solve the policy evaluation step directly as a linear system equation. Alternatively, we could carry out policy evaluation by applying successive approximation to the operator LdnL_{d_n} until convergence, or as in modified policy iteration, for just a few steps.

To apply the idea of fitting partial updates, we start at the level of the policy evaluation operator LdnL_{d_n}. For a given decision rule dnd_n, this operator in component form is:

(Ldnv)(s)=r(s,dn(s))+γv(s)p(dss,dn(s))(L_{d_n}v)(s) = r(s,d_n(s)) + \gamma \int v(s')p(ds'|s,d_n(s))

For a set of base points B={s1,...,sM}B = \{s_1, ..., s_M\}, we form our dataset:

Dn={(sk,yk):skB}\mathcal{D}_n = \{(s_k, y_k) : s_k \in B\}

where:

yk=r(sk,dn(sk))+γvn(s)p(dssk,dn(sk))y_k = r(s_k,d_n(s_k)) + \gamma \int v_n(s')p(ds'|s_k,d_n(s_k))

This gives us a way to perform approximate policy evaluation through function fitting. However, we now face the question of how to perform policy improvement in this parametric setting. The solution comes from the fact that in the exact form of policy iteration, we don’t need to improve the policy everywhere to guarantee progress. In fact, improving the policy at even a single state is sufficient for convergence.

This suggests a natural approach: rather than trying to approximate an improved policy over the entire state space, we can simply:

  1. Compute improved actions at our base points:

dn+1(sk)=argmaxaA{r(sk,a)+γvn(s)p(dssk,a)},skBd_{n+1}(s_k) = \arg\max_{a \in \mathcal{A}} \left\{r(s_k,a) + \gamma \int v_n(s')p(ds'|s_k,a)\right\}, \quad \forall s_k \in B
  1. Let the function approximation of the value function implicitly generalize these improvements to other states during the next policy evaluation phase.

This leads to the following algorithm:

As opposed to exact policy iteration, the iterates of parametric policy iteration need not converge monotonically to the optimal value function. Intuitively, this is because we use function approximation to generalize from base points to the entire state space which can lead to Value estimates improving at base points but degrading at other states or can cause interference between updates at different states due to the shared parametric representation

Q-Factor Representation

As we discussed above, Monte Carlo integration is the method of choice when it comes to approximating the effect of the Bellman operator. This is due to both its computational advantages in higher dimensions and its compatibility with the model-free assumption. However, there is an additional important detail that we have neglected to properly cover: extracting actions from values in a model-free fashion. While we can obtain a value function using the Monte Carlo approach described above, we still face the challenge of extracting an optimal policy from this value function.

More precisely, recall that an optimal decision rule takes the form:

d(s)=argmaxaA{r(s,a)+γv(s)p(dss,a)}d(s) = \arg\max_{a \in \mathcal{A}} \left\{r(s,a) + \gamma \int v(s')p(ds'|s,a)\right\}

Therefore, even given an optimal value function vv, deriving an optimal policy would still require Monte Carlo integration every time we query the decision rule/policy at a state.

An important idea in dynamic programming is that rather than approximating a state-value function, we can instead approximate a state-action value function. These two functions are related: the value function is the expectation of the Q-function (called Q-factors by some authors in the operations research literature) over the conditional distribution of actions given the current state:

v(s)=E[q(s,a)s]v(s) = \mathbb{E}[q(s,a)|s]

If qq^* is an optimal state-action value function, then v(s)=maxaq(s,a)v^*(s) = \max_a q^*(s,a). Just as we had a Bellman operator for value functions, we can also define an optimality operator for Q-functions. In component form:

(Lq)(s,a)=r(s,a)+γp(dss,a)maxaA(s)q(s,a)(\mathrm{L}q)(s,a) = r(s,a) + \gamma \int p(ds'|s,a)\max_{a' \in \mathcal{A}(s')} q(s', a')

Furthermore, this operator for Q-functions is also a contraction in the sup-norm and therefore has a unique fixed point qq^*.

The advantage of iterating over Q-functions rather than value functions is that we can immediately extract optimal actions without having to represent the reward function or transition dynamics directly, nor perform numerical integration. Indeed, an optimal decision rule at state ss is obtained as:

d(s)=argmaxaA(s)q(s,a)d(s) = \arg\max_{a \in \mathcal{A}(s)} q(s,a)

With this insight, we can adapt our parametric value iteration algorithm to work with Q-functions:

Initialization and Warmstarting

Parametric dynamic programming involves solving a sequence of related optimization problems, one for each fitting procedure at each iteration. While we’ve presented these as independent fitting problems, in practice we can leverage the relationship between successive iterations through careful initialization. This “warmstarting” strategy can significantly impact both computational efficiency and solution quality.

The basic idea is simple: rather than starting each fitting procedure from scratch, we initialize the function approximator with parameters from the previous iteration. This can speed up convergence since successive Q-functions tend to be similar. However, recent work suggests that persistent warmstarting might sometimes be detrimental, potentially leading to a form of overfitting. Alternative “reset” strategies that occasionally reinitialize parameters have shown promise in mitigating this issue.

Here’s how warmstarting can be incorporated into parametric Q-learning with one-step Monte Carlo integration:

The main addition here is the periodic reset of parameters (controlled by frequency kk) which helps balance the benefits of warmstarting with the need to avoid potential overfitting. When k=k=\infty, we get traditional persistent warmstarting, while k=1k=1 corresponds to training from scratch each iteration.

Inner Loop Convergence

Beyond the choice of initialization and whether to chain optimization problems through warmstarting, we can also control how we terminate the inner optimization procedure. In the templates presented above, we implicitly assumed that fit\texttt{fit} is run to convergence. However, this need not be the case, and different implementations handle this differently.

For example, scikit-learn’s MLPRegressor terminates based on several criteria: when the improvement in loss falls below a tolerance (default tol=1e-4), when it reaches the maximum number of iterations (default max_iter=200), or when the loss fails to improve for n_iter_no_change consecutive epochs. In contrast, ExtraTreesRegressor builds trees deterministically to completion based on its splitting criteria, with termination controlled by parameters like min_samples_split and max_depth.

The intuition for using early stopping in the inner optimization mirrors that of modified policy iteration in exact dynamic programming. Just as modified policy iteration truncates the Neumann series during policy evaluation rather than solving to convergence, we might only partially optimize our function approximator at each iteration. While this complicates the theoretical analysis, it often works well in practice and can be computationally more efficient.

This perspective helps us understand modern deep reinforcement learning algorithms. For instance, DQN can be viewed as an instance of fitted Q-iteration where the inner optimization is intentionally limited. Here’s how we can formalize this approach:

This formulation makes explicit the two-level optimization structure and allows us to control the trade-off between inner loop optimization accuracy and overall computational efficiency. When Ninner=1N_{inner}=1, we recover something closer to DQN’s update rule, while larger values of NinnerN_{inner} bring us closer to the full fitted Q-iteration approach.

Example Methods

There are several moving parts we can swap in and out when working with parametric dynamic programming - from the function approximator we choose, to how we warm start things, to the specific methods we use for numerical integration and inner optimization. In this section, we’ll look at some concrete examples and see how they fit into this general framework.

Kernel-Based Reinforcement Learning (2002)

Ormoneit and Sen’s Kernel-Based Reinforcement Learning (KBRL) Ormoneit & Sen (2002) helped establish the general paradigm of batch reinforcement learning later advocated by Ernst et al. (2005). KBRL is a purely offline method that first collects a fixed set of transitions and then uses kernel regression to solve the optimal control problem through value iteration on this dataset. While the dominant approaches at the time were online methods like temporal difference, KBRL showed that another path to developping reinforcement learning algorithm was possible: one that capable of leveraging advances in supervised learning to provide both theoretical and practical benefits.

As the name suggests, KBRL uses kernel based regression within the general framework of outlined above.

Step 3 is where KBRL uses kernel regression with a normalized weighting kernel:

kb(xtl,x)=ϕ(xtlx/b)lϕ(xtlx/b)k_b(x^l_t, x) = \frac{\phi(\|x^l_t - x\|/b)}{\sum_{l'} \phi(\|x^l_{t'} - x\|/b)}

where ϕ\phi is a kernel function (often Gaussian) and bb is the bandwidth parameter. Each iteration reuses the entire fixed dataset to re-estimate Q-values through this kernel regression.

An important theoretical contribution of KBRL is showing that this kernel-based approach ensures convergence of the Q-function sequence. The authors prove that, with appropriate choice of kernel bandwidth decreasing with sample size, the method is consistent - the estimated Q-function converges to the true Q-function as the number of samples grows.

The main practical limitation of KBRL is computational - being a batch method, it requires storing and using all transitions at each iteration, leading to quadratic complexity in the number of samples. The authors acknowledge this limitation for online settings, suggesting that modifications like discarding old samples or summarizing data clusters would be needed for online applications. Ernst’s later work with tree-based methods would help address this limitation while maintaining many of the theoretical advantages of the batch approach.

Ernst’s Fitted Q Iteration (2005)

Ernst’s Ernst et al. (2005) specific instantiation of parametric q-value iteration uses extremely randomized trees, an extension to random forests proposed by Geurts et al. (2006). This algorithm became particularly well-known, partly because it was one of the first to demonstrate the advantages of offline reinforcement learning in practice on several challenging benchmarks at the time.

Random Forests and Extra-Trees differ primarily in how they construct individual trees. Random Forests creates diversity in two ways: it resamples the training data (bootstrap) for each tree, and at each node it randomly selects a subset of features but then searches exhaustively for the best cut-point within each selected feature. In contrast, Extra-Trees uses the full training set for each tree and injects randomization differently: at each node, it randomly selects both features and cut-points without searching for the optimal one. It then picks the best among these completely random splits according to a variance reduction criterion. This double randomization - in both feature and cut-point selection - combined with using the full dataset makes Extra-Trees faster than Random Forests while maintaining similar predictive accuracy.

An important implementation detail concerns how tree structures can be reused across iterations of fitted Q iteration. With parametric methods like neural networks, warmstarting is straightforward - you simply initialize the weights with values from the previous iteration. For decision trees, the situation is more subtle because the model structure is determined by how splits are chosen at each node. When the number of candidate splits per node is K=1K=1 (totally randomized trees), the algorithm selects both the splitting variable and threshold purely at random, without looking at the target values (the Q-values we’re trying to predict) to evaluate the quality of the split. This means the tree structure only depends on the input variables and random choices, not on what we’re predicting. As a result, we can build the trees once in the first iteration and reuse their structure throughout all iterations, only updating the prediction values at the leaves.

Standard Extra-Trees (K>1K>1), however, uses target values to choose the best among K random splits by calculating which split best reduces the variance of the predictions. Since these target values change in each iteration of fitted Q iteration (as our estimate of Q evolves), we must rebuild the trees completely. While this is computationally more expensive, it allows the trees to better adapt their structure to capture the evolving Q-function.

The complete algorithm can be formalized as follows:

Neural Fitted Q Iteration (2005)

Riedmiller’s Neural Fitted Q Iteration (NFQI) Riedmiller (2005) is a natural instantiation of parametric Q-value iteration where:

  1. The function approximator q(s,a;θ)q(s,a; \boldsymbol{\theta}) is a multi-layer perceptron

  2. The fit\texttt{fit} function uses Rprop optimization trained to convergence on each iteration’s pattern set

  3. The expected next-state values are estimated through Monte Carlo integration with N=1N=1, using the observed next states from transitions

Specifically, rather than using numerical quadrature which would require known transition probabilities, NFQ approximates the expected future value using observed transitions:

qn(s,a)p(dss,a)qn(sobserved,a)\int q_n(s',a')p(ds'|s,a) \approx q_n(s'_{observed},a')

where sobserveds'_{observed} is the actual next state that was observed after taking action aa in state ss. This is equivalent to Monte Carlo integration with a single sample, making the algorithm fully model-free.

The algorithm follows from the parametric Q-value iteration template:

While NFQI was originally introduced as an offline method with base points collected a priori, the authors also present a variant where base points are collected incrementally. In this online variant, new transitions are gathered using the current policy (greedy with respect to QkQ_k) and added to the experience set. This approach proves particularly useful when random exploration cannot efficiently collect representative experiences.

Deep Q Networks (2013)

DQN Mnih et al. (2013) is a close relative of NFQI - in fact, Riedmiller, the author of NFQI, was also an author on the DQN paper. What at first glance might look like a different algorithm can actually be understood as a special case of parametric dynamic programming with practical adaptations. Let’s build this connection step by step.

First, let’s start with basic parametric Q-value iteration using a neural network:

Next, let’s open up the fit procedure to show the inner optimization loop using gradient descent:

Warmstarting and Partial Fitting

A natural modification is to initialize the inner optimization loop with the previous iteration’s parameters - a strategy known as warmstarting - rather than starting from θ0\boldsymbol{\theta}_0 each time. Additionally, similar to how modified policy iteration performs partial policy evaluation rather than solving to convergence, we can limit ourselves to a fixed number of optimization steps. These pragmatic changes, when combined, yield:

Flattening the Updates with Target Swapping

Now rather than maintaining two sets of indices for the outer and inner levels, we could also “flatten” this algorithm under a single loop structure using modulo arithmetics. Here’s how we could rewrite it:

The flattened version with target parameters achieves exactly the same effect as our previous nested-loop structure with warmstarting and K gradient steps. In the nested version, we would create a dataset using parameters θn\boldsymbol{\theta}_n, then perform K gradient steps to obtain θn+1\boldsymbol{\theta}_{n+1}. In our flattened version, we maintain a separate θtarget\boldsymbol{\theta}_{target} that gets updated every K steps, ensuring that the dataset Dn\mathcal{D}_n is created using the same parameters for K consecutive iterations - just as it would be in the nested version. The only difference is that we’ve restructured the algorithm to avoid explicitly nesting the loops, making it more suitable for continuous online training which we are about to introduce. The periodic synchronization of θtarget\boldsymbol{\theta}_{target} with the current parameters θn\boldsymbol{\theta}_n effectively marks the boundary of what would have been the outer loop in our previous version.

Exponential Moving Average Targets

An alternative to this periodic swap of parameters is to use an exponential moving average (EMA) of the parameters:

Note that the original DQN used the periodic swap of parameters rather than EMA targets. EMA targets (also called “Polyak averaging”) started becoming popular in deep RL with DDPG Lillicrap et al. (2015) where they used a “soft” target update: θtargetτθ+(1τ)θtarget\boldsymbol{\theta}_{target} \leftarrow \tau\boldsymbol{\theta} + (1-\tau)\boldsymbol{\theta}_{target} with a small τ\tau (like 0.001). This has since become a common choice in many algorithms like TD3 Fujimoto et al. (2018) and SAC Haarnoja et al. (2018).

Online Data Collection and Experience Replay

Rather than using offline data, we now consider a modification where we incrementally gather samples under our current policy. A common exploration strategy is ε\varepsilon-greedy: with probability ε\varepsilon we select a random action, and with probability 1ε1-\varepsilon we select the greedy action argmaxaq(s,a;θn)\arg\max_a q(s,a;\boldsymbol{\theta}_n). This ensures we maintain some exploration even as our Q-function estimates improve. Typically ε\varepsilon is annealed over time, starting with a high value (e.g., 1.0) to encourage early exploration and gradually decreasing to a small final value (e.g., 0.01) to maintain a minimal level of exploration while mostly exploiting our learned policy.

This version faces two practical challenges. First, the transition dataset Tn\mathcal{T}_n grows unbounded over time, creating memory issues. Second, computing gradients over the entire dataset becomes increasingly expensive. These are common challenges in online learning settings, and the standard solutions from supervised learning apply here:

  1. Use a fixed-size circular buffer (often called replay buffer, in reference to “experience replay” by Lin (1992)) to limit memory usage

  2. Compute gradients on mini-batches rather than the full dataset

Here’s how we can modify our algorithm to incorporate these ideas:

This formulation naturally leads to an important concept in deep reinforcement learning: the replay ratio (or data reuse ratio). In our algorithm, for each new transition we collect, we sample a mini-batch of size b from our replay buffer and perform one update. This means we’re reusing past experiences at a ratio of b:1 - for every new piece of data, we’re learning from b experiences. This ratio can be tuned as a hyperparameter. Higher ratios mean more computation per environment step but better data efficiency, as we’re extracting more learning from each collected transition. This highlights one of the key benefits of experience replay: it allows us to decouple the rate of data collection from the rate of learning updates. Some modern algorithms like SAC or TD3 explicitly tune this ratio, sometimes using multiple gradient steps per environment step to achieve higher data efficiency.

I’ll write a subsection that naturally follows from the previous material and introduces double Q-learning in the context of DQN.

Double-Q Network Variant

As we saw earlier, the max operator in the target computation can lead to overestimation of Q-values. This happens because we use the same network to both select and evaluate actions in the target computation: yiri+γmaxaAq(si,a;θtarget)y_i \leftarrow r_i + \gamma \max_{a' \in A} q(s_i',a'; \boldsymbol{\theta}_{target}). The max operator means we’re both choosing the action that looks best under our current estimates and then using that same set of estimates to evaluate how good that action is, potentially compounding any optimization bias.

Double DQN Van Hasselt et al. (2016) addresses this by using the current network parameters to select actions but the target network parameters to evaluate them. This leads to a simple modification of the DQN algorithm:

The main difference from the original DQN is in step 7, where we now separate action selection from action evaluation. Rather than directly taking the max over the target network’s Q-values, we first select the action using our current network (θn\boldsymbol{\theta}_n) and then evaluate that specific action using the target network (θtarget\boldsymbol{\theta}_{target}). This simple change has been shown to lead to more stable learning and better final performance across a range of tasks.

Deep Q Networks with Resets (2022)

In flattening neural fitted Q-iteration, our field had perhaps lost sight of an important structural element: the choice of inner-loop initializer inherent in the original FQI algorithm. The traditional structure explicitly separated outer iterations (computing targets) from inner optimization (fitting to those targets), with each inner optimization starting fresh from parameters θ0\boldsymbol{\theta}_0.

The flattened version with persistent warmstarting seemed like a natural optimization - why throw away learned parameters? However, recent work D'Oro et al. (2023) has shown that persistent warmstarting can actually be detrimental to learning. Neural networks tend to lose their ability to learn and generalize over the course of training, suggesting that occasionally starting fresh from θ0\boldsymbol{\theta}_0 might be beneficial. Here’s how this looks algorithmically in the context of DQN:

This algorithm change allows us to push the limits of our update ratio - the number of gradient steps we perform per environment interaction. Without resets, increasing this ratio leads to diminishing returns as the network’s ability to learn degrades. However, by periodically resetting the parameters while maintaining our dataset of transitions, we can perform many more updates per interaction, effectively making our algorithm more “offline” and thus more sample efficient.

The hard reset strategy, while effective, might be too aggressive in some settings as it completely discards learned parameters. An alternative approach is to use a softer form of reset, adapting the “Shrink and Perturb” technique originally introduced by Ash & Adams (2020) in the context of continual learning. In their work, they found that neural networks that had been trained on one task could better adapt to new tasks if their parameters were partially reset - interpolated with a fresh initialization - rather than either kept intact or completely reset.

We can adapt this idea to our setting. Instead of completely resetting to θ0\boldsymbol{\theta}_0, we can perform a soft reset by interpolating between our current parameters and a fresh random initialization:

The interpolation coefficient β\beta controls how much of the learned parameters we retain, with β=0\beta = 0 recovering the hard reset case and β=1\beta = 1 corresponding to no reset at all. This provides a more flexible approach to restoring learning capability while potentially preserving useful features that have been learned. Like hard resets, this softer variant still enables high update ratios by preventing the degradation of learning capability, but does so in a more gradual way.

Does Parametric Dynamic Programming Converge?

So far we have avoided the discussion of convergence and focused on intuitive algorithm development, showing how we can extend successive approximation by computing only a few operator evaluations which then get generalized over the entire domain at each step of the value iteration procedure. Now we turn our attention to understanding the conditions under which this general idea can be shown to converge.

A crucial question to ask is whether our algorithm maintains the contraction property that made value iteration so appealing in the first place - the property that allowed us to show convergence to a unique fixed point. We must be careful here because the contraction mapping theorem is specific to a given norm. In the case of value iteration, we showed the Bellman optimality operator is a contraction in the sup-norm, which aligns naturally with how we compare policies based on their value functions.

The situation becomes more complicated with fitted methods because we are not dealing with just a single operator. At each iteration, we perform exact, unbiased pointwise evaluations of the Bellman operator, but instead of obtaining the next function exactly, we get the closest representable one under our chosen function approximation scheme. Gordon Gordon (1995) showed that the fitting step can be conceptualized as an additional operator that gets applied on top of the exact Bellman operator to produce the next function parameters. This leads to viewing fitted value methods - which for simplicity we describe only for the value case, though the Q-value setting follows similarly - as the composition of two operators:

vn+1=Γ(L(vn))v_{n+1} = \Gamma(\mathrm{L}(v_n))

where L\mathrm{L} is the Bellman operator and Γ\Gamma represents the function approximation mapping.

Now we arrive at the central question: if L\mathrm{L} was a sup-norm contraction, is Γ\Gamma composed with L\mathrm{L} still a sup-norm contraction? What conditions must hold for this to be true? This question is fundamental because if we can establish that the composition of these two operators maintains the contraction property in the sup-norm, we get directly that our resulting successive approximation method will converge.

The Search for Nonexpansive Operators

Consider what happens in the fitting step: we have two value functions vv and ww, and after applying the Bellman operator L\mathrm{L} to each, we get new target values that differ by at most γ\gamma times their original difference in sup-norm (due to L\mathrm{L} being a γ\gamma-contraction in the sup norm). But what happens when we fit to these target values? If the function approximator can exaggerate differences between its target values, even a small difference in the targets could lead to a larger difference in the fitted functions. This would be disastrous - even though the Bellman operator shrinks differences between value functions by a factor of γ\gamma, the fitting step could amplify them back up, potentially breaking the contraction property of the composite operator.

In order to ensure that the composite operator is contractive, we need conditions on Γ\Gamma such that if L\mathrm{L} is a sup-norm contraction then the composition also is. A natural property to consider is when Γ\Gamma is a non-expansion. By definition, this means that for any functions vv and ww:

Γ(v)Γ(w)vw\|\Gamma(v) - \Gamma(w)\|_\infty \leq \|v - w\|_\infty

This turns out to be exactly what we need, since if Γ\Gamma is a non-expansion, then for any functions vv and ww:

Γ(L(v))Γ(L(w))L(v)L(w)γvw\|\Gamma(\mathrm{L}(v)) - \Gamma(\mathrm{L}(w))\|_\infty \leq \|L(v) - L(w)\|_\infty \leq \gamma\|v - w\|_\infty

The first inequality uses the non-expansion property of Γ\Gamma, while the second uses the fact that L\mathrm{L} is a γ\gamma-contraction. Together they show that the composite operator ΓL\Gamma \circ L remains a γ\gamma-contraction.

Gordon’s Averagers

But which function approximators satisfy this non-expansion property? Gordon shows that “averagers”, approximators that compute their outputs as weighted averages of their training values, are always non-expansions in sup-norm. This includes many common approximation schemes like k-nearest neighbors, linear interpolation, and kernel smoothing with normalized weights. The intuition is that if you’re taking weighted averages with weights that sum to one, you can never extrapolate beyond the range of your training values. These methods “interpolate”. This theoretical framework explains why simple interpolation methods like k-nearest neighbors have proven very stable in practice, while more complex approximators can fail catastrophically. To guarantee convergence, we should either use averagers directly or modify other approximators to ensure they never extrapolate beyond their training targets.

More precisely, a function approximator Γ\Gamma is an averager if for any state ss and any target function vv, the fitted value can be written as:

Γ(v)(s)=i=1nwi(s)v(si)\Gamma(v)(s) = \sum_{i=1}^n w_i(s) v(s_i)

where the weights wi(s)w_i(s) satisfy:

  1. wi(s)0w_i(s) \geq 0 for all ii and ss

  2. i=1nwi(s)=1\sum_{i=1}^n w_i(s) = 1 for all ss

  3. The weights wi(s)w_i(s) depend only on ss and the training points {si}\{s_i\}, not on the values v(si)v(s_i)

Let m=miniv(si)m = \min_i v(s_i) and M=maxiv(si)M = \max_i v(s_i). Then:

m=miwi(s)iwi(s)v(si)Miwi(s)=Mm = m\sum_i w_i(s) \leq \sum_i w_i(s)v(s_i) \leq M\sum_i w_i(s) = M

So Γ(v)(s)[m,M]\Gamma(v)(s) \in [m,M] for all ss, meaning the fitted function cannot take values outside the range of its training values. This property is what makes averagers “interpolate” rather than “extrapolate” and is directly related to why they preserve the contraction property when composed with the Bellman operator. To see why averagers are non-expansions, consider two functions vv and ww. At any state ss:

Γ(v)(s)Γ(w)(s)=i=1nwi(s)v(si)i=1nwi(s)w(si)=i=1nwi(s)(v(si)w(si))i=1nwi(s)v(si)w(si)vwi=1nwi(s)=vw\begin{align*} |\Gamma(v)(s) - \Gamma(w)(s)| &= \left|\sum_{i=1}^n w_i(s)v(s_i) - \sum_{i=1}^n w_i(s)w(s_i)\right| \\ &= \left|\sum_{i=1}^n w_i(s)(v(s_i) - w(s_i))\right| \\ &\leq \sum_{i=1}^n w_i(s)|v(s_i) - w(s_i)| \\ &\leq \|v - w\|_\infty \sum_{i=1}^n w_i(s) \\ &= \|v - w\|_\infty \end{align*}

Since this holds for all ss, we have Γ(v)Γ(w)vw\|\Gamma(v) - \Gamma(w)\|_\infty \leq \|v - w\|_\infty, proving that Γ\Gamma is a non-expansion.

Which Function Approximators Interpolate vs Extrapolate?

K-nearest neighbors (KNN)

Let’s look at specific examples, starting with k-nearest neighbors. For any state ss, let s(1),...,s(k)s_{(1)}, ..., s_{(k)} denote the k nearest training points to ss. Then:

Γ(v)(s)=1ki=1kv(s(i))\Gamma(v)(s) = \frac{1}{k}\sum_{i=1}^k v(s_{(i)})

This is an averager with weights wi(s)=1kw_i(s) = \frac{1}{k} for the k nearest neighbors and 0 for all other points.

For kernel smoothing with a kernel function KK, the fitted value is:

Γ(v)(s)=i=1nK(ssi)v(si)i=1nK(ssi)\Gamma(v)(s) = \frac{\sum_{i=1}^n K(s - s_i)v(s_i)}{\sum_{i=1}^n K(s - s_i)}

The denominator normalizes the weights to sum to 1, making this an averager with weights wi(s)=K(ssi)j=1nK(ssj)w_i(s) = \frac{K(s - s_i)}{\sum_{j=1}^n K(s - s_j)}.

Linear Regression

In contrast, methods like linear regression and neural networks can and often do extrapolate beyond their training targets. More precisely, given a dataset of state-value pairs {(si,v(si))}i=1n\{(s_i, v(s_i))\}_{i=1}^n, these methods fit parameters to minimize some error criterion, and the resulting function Γ(v)(s)\Gamma(v)(s) may take values outside the interval [miniv(si),maxiv(si)][\min_i v(s_i), \max_i v(s_i)] even when evaluated at a new state ss. For instance, linear regression finds parameters by minimizing squared error:

minθi=1n(v(si)θTϕ(si))2\min_{\theta} \sum_{i=1}^n (v(s_i) - \theta^T\phi(s_i))^2

The resulting fitted function is:

Γ(v)(s)=ϕ(s)T(ΦTΦ)1ΦTv\Gamma(v)(s) = \phi(s)^T(\Phi^T\Phi)^{-1}\Phi^T v

where Φ\Phi is the feature matrix with rows ϕ(si)T\phi(s_i)^T. This cannot be written as a weighted average with weights independent of vv. Indeed, we can construct examples where the fitted value at a point lies outside the range of training values. For example, consider two sets of target values defined on just three points s1=0s_1 = 0, s2=1s_2 = 1, and s3=2s_3 = 2:

v=[001],w=[011]v = \begin{bmatrix} 0 \\ 0 \\ 1 \end{bmatrix}, \quad w = \begin{bmatrix} 0 \\ 1 \\ 1 \end{bmatrix}

Using a single feature ϕ(s)=s\phi(s) = s, our feature matrix is:

Φ=[012]\Phi = \begin{bmatrix} 0 \\ 1 \\ 2 \end{bmatrix}

For function vv, the fitted parameters are:

θv=(ΦTΦ)1ΦTv=114(2)\theta_v = (\Phi^T\Phi)^{-1}\Phi^T v = \frac{1}{14}(2)

And for function ww:

θw=(ΦTΦ)1ΦTw=114(8)\theta_w = (\Phi^T\Phi)^{-1}\Phi^T w = \frac{1}{14}(8)

Now if we evaluate these fitted functions at s=3s = 3 (outside our training points):

Γ(v)(3)=3θv=6140.43\Gamma(v)(3) = 3\theta_v = \frac{6}{14} \approx 0.43
Γ(w)(3)=3θw=24141.71\Gamma(w)(3) = 3\theta_w = \frac{24}{14} \approx 1.71

Therefore:

Γ(v)(3)Γ(w)(3)=1814>1=vw|\Gamma(v)(3) - \Gamma(w)(3)| = \frac{18}{14} > 1 = \|v - w\|_\infty

Spline Interpolation

Linear interpolation between points -- the technique used earlier in this chapter -- is an averager since for any point ss between knots sis_i and si+1s_{i+1}:

Γ(v)(s)=(si+1ssi+1si)v(si)+(ssisi+1si)v(si+1)\Gamma(v)(s) = \left(\frac{s_{i+1}-s}{s_{i+1}-s_i}\right)v(s_i) + \left(\frac{s-s_i}{s_{i+1}-s_i}\right)v(s_{i+1})

The weights sum to 1 and are non-negative. However, cubic splines, despite their smoothness advantages, can violate the non-expansion property. To see this, consider fitting a natural cubic spline to three points:

s1=0,  s2=1,  s3=2s_1 = 0,\; s_2 = 1,\; s_3 = 2

with two different sets of values:

v=[010],w=[000]v = \begin{bmatrix} 0 \\ 1 \\ 0 \end{bmatrix}, \quad w = \begin{bmatrix} 0 \\ 0 \\ 0 \end{bmatrix}

The natural cubic spline for vv will overshoot at s0.5s \approx 0.5 and undershoot at s1.5s \approx 1.5 due to its attempt to minimize curvature, giving values outside the range [0,1][0,1]. Meanwhile, ww fits a flat line at 0. Therefore:

vw=1\|v - w\|_\infty = 1

but

Γ(v)Γ(w)>1\|\Gamma(v) - \Gamma(w)\|_\infty > 1

This illustrates a general principle: methods that try to create smooth functions by minimizing some global criterion (like curvature in splines) often sacrifice the non-expansion property to achieve their smoothness goals.

References
  1. Bertsekas, D. P. (1983). Distributed asynchronous computation of fixed points. Mathematical Programming, 27(1), 107–120. 10.1007/bf02591967
  2. Kortum, S. (1992). Value Function Approximation in an Estimation Routine.
  3. Rust, J. (1996). Chapter 14 Numerical dynamic programming in economics. In Handbook of Computational Economics (pp. 619–729). Elsevier. 10.1016/s1574-0021(96)01016-7
  4. Ernst, D., Geurts, P., & Wehenkel, L. (2005). Tree-Based Batch Mode Reinforcement Learning. Journal of Machine Learning Research, 6, 503–556.
  5. Riedmiller, M. (2005). Neural Fitted Q Iteration – First Experiences with a Data Efficient Neural Reinforcement Learning Method. Proceedings of the 16th European Conference on Machine Learning (ECML), 317–328.
  6. Ormoneit, D., & Sen, Ś. (2002). Kernel-Based Reinforcement Learning. Machine Learning, 49(2/3), 161–178. 10.1023/a:1017928328829
  7. Ernst, D., Geurts, P., & Wehenkel, L. (2005). Tree-Based Batch Mode Reinforcement Learning. J. Mach. Learn. Res., 6, 503–556. https://jmlr.org/papers/v6/ernst05a.html
  8. Geurts, P., Ernst, D., & Wehenkel, L. (2006). Extremely randomized trees. Machine Learning, 63(1), 3–42. 10.1007/s10994-006-6226-1
  9. Riedmiller, M. A. (2005). Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method. In J. Gama, R. Camacho, P. Brazdil, A. Jorge, & L. Torgo (Eds.), Machine Learning: ECML 2005, 16th European Conference on Machine Learning, Porto, Portugal, October 3-7, 2005, Proceedings (Vol. 3720, pp. 317–328). Springer. https://doi.org/10.1007/11564096\_32
  10. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., & others. (2013). Playing Atari with Deep Reinforcement Learning. NIPS Deep Learning Workshop.
  11. Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., & Wierstra, D. (2015). Continuous Control with Deep Reinforcement Learning. arXiv Preprint arXiv:1509.02971.
  12. Fujimoto, S., Hoof, H., & Meger, D. (2018). Addressing Function Approximation Error in Actor-Critic Methods. International Conference on Machine Learning (ICML), 1587–1596.
  13. Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. Proceedings of the 35th International Conference on Machine Learning (ICML), 1861–1870.
  14. Lin, L.-J. (1992). Self-improving reactive agents based on reinforcement learning, planning, and teaching [Phdthesis]. Carnegie Mellon University.
  15. Van Hasselt, H., Guez, A., & Silver, D. (2016). Deep Reinforcement Learning with Double Q-learning. Proceedings of the AAAI Conference on Artificial Intelligence, 30(1).